NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression

https://doi.org/10.1145/3695053.3731024

Cheng, Feng; Guo, Cong; Wei, Chiyue; Zhang, Junyao; Zhou, Changchun; Hanson, Edward; Zhang, Jiaqi; Liu, Xiaoxiao; Li, Hai; Chen, Yiran (June 2025, ACM)

Full Text Available
NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

https://doi.org/10.1109/TC.2024.3365939

Li, Shiyu; Wang, Yitu; Hanson, Edward; Chang, Andrew; Seok_Ki, Yang; Li, Hai; Chen, Yiran (May 2024, IEEE Transactions on Computers)

Full Text Available
Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators

https://doi.org/10.1109/TCAD.2024.3409193

Wu, Xueying; Hanson, Edward; Wang, Nansu; Zheng, Qilin; Yang, Xiaoxuan; Yang, Huanrui; Li, Shiyu; Cheng, Feng; Pande, Partha Pratim; Doppa, Janardhan Rao; et al (June 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
DefT: Boosting Scalability of Deformable Convolution Operations on GPUs

https://doi.org/10.1145/3582016.3582017

Hanson, Edward; Horton, Mark; Li, Hai; Chen, Yiran (March 2023, The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS ’23),)

Deformable Convolutional Networks (DCN) have been proposed as a powerful tool to boost the representation power of Convolutional Neural Networks (CNN) in computer vision tasks via adaptive sampling of the input feature map. Much like vision transformers, DCNs utilize a more flexible inductive bias than standard CNNs and have also been shown to improve performance of particular models. For example, drop-in DCN layers were shown to increase the AP score of Mask RCNN by 10.6 points while introducing only 1% additional parameters and FLOPs, improving the state-of-the art model at the time of publication. However, despite evidence that more DCN layers placed earlier in the network can further improve performance, we have not seen this trend continue with further scaling of deformations in CNNs, unlike for vision transformers. Benchmarking experiments show that a realistically sized DCN layer (64H×64W, 64 in-out channel) incurs a 4× slowdown on a GPU platform, discouraging the more ubiquitous use of deformations in CNNs. These slowdowns are caused by the irregular input-dependent access patterns of the bilinear interpolation operator, which has a disproportionately low arithmetic intensity (AI) compared to the rest of the DCN. To address the disproportionate slowdown of DCNs and enable their expanded use in CNNs, we propose DefT, a series of workload-aware optimizations for DCN kernels. DefT identifies performance bottlenecks in DCNs and fuses specific operators that are observed to limit DCN AI. Our approach also uses statistical information of DCN workloads to adapt the workload tiling to the DCN layer dimensions, minimizing costly out-of-boundary input accesses. Experimental results show that DefT mitigates up to half of DCN slowdown over the current-art PyTorch implementation. This translates to a layerwise speedup of up to 134% and a reduction of normalized training time of 46% on a fully DCN-enabled ResNet model.
more » « less
Full Text Available
DyNNamic: Dynamically Reshaping, High Data-Reuse Accelerator for Compact DNNs

https://doi.org/10.1109/TC.2022.3184272

Hanson, Edward; Li, Shiyu; Qian, Xuehai; Li, Hai Helen; Chen, Yiran (March 2023, IEEE Transactions on Computers)

Full Text Available
Cascading structured pruning: enabling high data reuse for sparse DNN accelerators

https://doi.org/10.1145/3470496.3527419

Hanson, Edward; Li, Shiyu; Li, Hai 'Helen'; Chen, Yiran (June 2022, International Symposium on Computer Architecture (ISCA))

Full Text Available
An Efficient 3D ReRAM Convolution Processor Design for Binarized Weight Networks

https://doi.org/10.1109/TCSII.2021.3067840

Kim, Bokyung; Hanson, Edward; Li, Hai (May 2021, IEEE Transactions on Circuits and Systems II: Express Briefs)
null (Ed.)
Full Text Available
ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

https://doi.org/10.1145/3466752.3480043

Li, Shiyu; Hanson, Edward; Qian, Xuehai; Li, Hai "Helen"; Chen, Yiran (October 2021, IEEE/ACM International Symposium on Microarchitecture)
null (Ed.)
The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution.
more » « less
Full Text Available
PENNI: Pruned Kernel Sharing for Efficient CNN Inference

LI, Shiyu; Hanson, Edward; Li, Hai Li; Chen, Yiran (July 2020, International Conference on Machine Learning)

Although state-of-the-art (SOTA) CNNs achieve outstanding performance on various tasks, their high computation demand and massive number of parameters make it difficult to deploy these SOTA CNNs onto resource-constrained devices. Previous works on CNN acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. Experiments show that we can prune 97% parameters and 92% FLOPs on ResNet18 CIFAR10 with no accuracy loss, and achieve 44% reduction in run-time memory consumption and a 53% reduction in inference latency.
more » « less
Full Text Available

Search for: All records